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A sensitivity analysis in an observational study determines the 
V^ ' magnitude of bias from nonrandom treatment assignment that would 

need to be present to alter the qualitative conclusions of a naive anal- 
ysis that presumes all biases were removed by matching or by other 
analytic adjustments. The power of a sensitivity analysis and the de- 
^ ■ sign sensitivity anticipate the outcome of a sensitivity analysis under 

'^'n ' an assumed model for the generation of the data. It is known that 

the power of a sensitivity analysis is affected by the choice of test 
statistic, and, in particular, that a statistic with good Pitman efS- 
C/3 I ciency in a randomized experiment, such as Wilcoxon's signed rank 

statistic, may have low power in a sensitivity analysis and low de- 
sign sensitivity when compared to other statistics. For instance, for 
^ , an additive treatment effect and errors that are Normal or logistic 

^Nj ■ or ^-distributed with 3 degrees of freedom. Brown's combined quan- 

t^^ ' tile average test has Pitman efficiency close to that of Wilcoxon's 

>0 , test but has higher power in a sensitivity analysis, while a version 

C^ ' of Noether's test has poor Pitman efficiency in a randomized experi- 

fT^ I ment but much higher design sensitivity so it is vastly more powerful 

("^ ■ than Wilcoxon's statistic in a sensitivity analysis if the sample size is 

O^ ' sufficiently large. A new exact distribution-free test is proposed that 

rejects if either Brown's test or Noether's test rejects after adjust- 
ing the two critical values so the overall level of the combined test 
remains at a, conventionally a — 0.05. In every sampling situation, 
^\^ , the design sensitivity of the adaptive test equals the larger of the two 

JH ' design sensitivities of the component tests. The adaptive test exhibits 

- - -' good power in sensitivity analyses asymptotically and in simulations. 

In one sampling situation — Normal errors and an additive effect that 
is three-quarters of the standard deviation with 500 matched pairs — 
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2 P. R. ROSENBAUM 

the power of Wilcoxon's test in a sensitivity analysis was 2% and 
the power of the adaptive test was 87%. A study of treatments for 
ovarian cancer in the Medicare population is discussed in detail. 

1. Introduction: Motivation; example; outline. 

1.1. Are large observational studies less susceptible to unmeasured biases? 
There is certainly a sense in which large observational studies are more — not 
less — susceptible to unmeasured biases than smaller studies. Biases due to 
nonrandom treatment assignment generally do not become smaller as the 
sample size increases. These biases are due to the failure to control some un- 
measured covariate that would have been balanced by random assignment 
of treatments. If a large observational study is analyzed naively under the 
assumption that adjustments for measured covariates have, in effect, trans- 
formed the study into a randomized experiment, then as the sample size 
increases even very small biases due to unmeasured covariates can seriously 
distort the level of significance tests and the coverage of confidence intervals; 
see Cochran (1965), Section 3.1. 

Suppose, however, that the analysis takes explicit account of uncertainty 
about unmeasured biases by performing a sensitivity analysis. Is a large 
sample size of any assistance in this case? It is known that the degree of sen- 
sitivity to unmeasured biases is affected by many aspects of the design and 
analysis of an observational study [Rosenbaum (2004, 2010b)], but the rele- 
vant decisions about design and analysis are often difficult to make without 
guidance from empirical data. Heller, Rosenbaum and Small (2009) found 
that sample splitting — sacrificing a small portion, say, 10%, of the sample to 
guide design and analysis — could, in favorable circumstances, yield reduced 
sensitivity to unmeasured biases by guiding the needed decisions. Sample 
splitting has the advantage, emphasized by Cox (1975), of permitting reflec- 
tion and judgement in light of data without invalidating the formal proper- 
ties of statistical procedures. However, some questions, such as the thickness 
of the tails of distributions, are difficult to settle using a small fraction of the 
sample, and may require guidance from the complete sample. Here, an adap- 
tive test is proposed that chooses between two tests with different properties, 
and in one sense achieves the performance of the better test in large sam- 
ples; see Proposition 1 in Section 4.3. Although motivated by large sample 
calculations, the adaptive procedure performs well in simulations in samples 
as small as 100 matched pairs. 

1.2. Example: Is more chemotherapy for ovarian cancer more effective? 
Following surgery to remove a visible tumor, the typical reason that one can- 
cer patient receives more chemotherapy than another is that their cancers 
differ in localization or recurrence. A straightforward comparison of patients 
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receiving more or less chemotherapy is hkely to be biased by comparing sicker 
patients to healthier ones. Is there a better comparison? Ovarian cancer is 
unusual in this regard, because there is a source of variation in the inten- 
sity of chemotherapy that is not a reaction to the patient and her illness. 
Chemotherapy for ovarian cancer may be provided by either a medical on- 
cologist who treats cancers of all kinds or by a gynecological oncologist who 
treats cancers of the ovary, uterus and cervix. Medical oncologists (MOs) 
and gynecological oncologists (GOs) differ in both training and practice. 
In particular, GOs are gynecologists, and hence surgeons, perhaps the best 
surgeons for gynecological cancers, and they often perform surgery for ovar- 
ian cancer, whereas MOs are almost invariably not surgeons and administer 
chemotherapy after someone else, perhaps a general surgeon, a gynecologist 
or GO, has performed surgery. Typically, an MO had a residency in internal 
medicine followed by a 3-year fellowship in oncology emphasizing the use 
of chemotherapy, whereas a GO had a residency in obstetrics and gynecol- 
ogy followed by a fellowship in gynecologic oncology with attention paid to 
surgical treatment of ovarian cancer. Silber et al. (2007) hypothesized cor- 
rectly that MOs would use chemotherapy more intensively than GOs, and 
they used this difference in intensity to ask whether more chemotherapy is 
of benefit to the patient. 

Using data from Medicare and the Surveillance, Epidemiology and End 
Results (SEER) program of the U.S. National Cancer Institute, Silber et al. 
(2007) looked at patients with ovarian cancer between 1991 and 2001 who 
received chemotherapy after appropriate surgery; see their paper for details 
of the patient population. They matched all / = 344 such ovarian cancer 
patients treated by a gynecologic oncologist to 344 ovarian cancer patients 
treated by a medical oncologist. Using the matching algorithm of Rosen- 
baum, Ross and Silber (2007), the matching controlled for 36 covariates, 
including clinical stage, tumor grade, surgeon type, comorbid conditions 
such as diabetes and congestive heart failure, age, race and year of diagnosis 
[Silber et al. (2007), Tables 2 and 3]. Importantly, the duration of follow-up 
was virtually identical in the two groups. On average, during the five years 
after diagnosis, the patients of medical oncologists received about four more 
weeks of chemotherapy, with MO patients receiving on average 16.5 weeks 
of chemotherapy and GO patients receiving 12.1 weeks. The upper portion 
of Figure 1 is a pair of two quantile-quantile plots [Wilk and Gnanadesikan 
(1968)] of weeks of chemotherapy in the first year or the first five years for 
the 344 GO patients and the 344 MO patients, momentarily ignoring who 
is matched to whom. Because the points lie above the line of equality, the 
distribution of chemotherapy weeks for MO patients appears to be stochas- 
tically larger than the distribution for GO patients. Survival was virtually 
identical with nearly identical Kaplan-Meier survival curves that crossed 
repeatedly, and a median survival of 2.98 years in the MO group and 3.04 
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Fig. 1. Four quantile-quantile plots of weeks with either chemotherapy or toxicity for 
I = 344 pairs of an MO and a GO patient. Quantile-quantile plots describe the marginal 
distributions, ignoring who is paired with whom. 



years in the GO group [Silber et al. (2007), Figure 1 and Table 1]. Patients of 
medical oncologists experienced more weeks with chemotherapy associated 
side effects or toxicity, such as anemia, neutropenia, thrombocytopenia and 
drug induced neuropathy, on average over five years, 16.2 weeks for MOs and 
8.9 weeks for GOs; see the bottom half of Figure 2. If Wilcoxon's signed rank 
test is used to compare weeks with toxicity in matched pairs, the P- values 
are less than 10~^ for both year one and the first five years, but of course 
those P-values take no account of possible biases in this nonrandomized 
comparison. In brief, greater intensity of chemotherapy was not associated 
with longer survival, but it was associated with more frequent side effects. 
The study generated some discussion, in particular, an editorial, five let- 
ters discussing either the study or the editorial, and two rejoinders, one from 
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the authors of the paper and one from the author of the editorial, or 11 pages 
of pubhshed discussion of a 7 page paper. Happily, matching for 36 measured 
covariates was convincing in the very limited sense such adjustments can be 
convincing: none of the discussion expressed continued concern about these 
36 measured covariates, which include many of the key covariates for ovarian 
cancer. Virtually all of the discussion concerned possible unmeasured biases, 
possible ways the MO and GO groups may differ besides the 36 measured 
covariates. The editorial by an MO mentioned the magnitude of "residual 
tumor" not removed by surgery, a covariate not recorded in SEER, and sug- 
gested the possibility that GOs were less prone to notice toxicity, whereas the 
first letter by two GOs characterized these comments as "spinning a tale." 
The particulars of the discussion had strengths and weaknesses, but, in an 
abstract sense, a concern about possible unmeasured biases is reasonable in 
most if not all observational studies, and in that sense the discussion was 
constructively focused on the central issue. A disappointing feature of the 
11 pages of discussion was that it contained little in the way of data, quanti- 
tative analysis or evidence, although there was a little data in one rejoinder. 
A sensitivity analysis in an observational study is an attempt to return to 
the data and to quantitative analysis when discussing the possible impact 
of unmeasured biases. 

How large would the departure from random assignment have to be to 
alter the conclusions? The answer is determined by a sensitivity analysis. 
The degree of sensitivity to unmeasured biases in this study is noticeably 
affected by the choice of test statistic; see Section 5. Theoretical considera- 
tions suggest that certain statistics, for instance, Wilcoxon's statistic, tend 
to exaggerate the degree of sensitivity to unmeasured biases, at least for 
additive treatment effects with symmetric errors [Rosenbaum (2010a)], so 
perhaps certain methods may be excluded on purely theoretical grounds. On 
the other hand, many issues affect the degree of sensitivity to bias reported 
by different test statistics [Rosenbaum (2010b), Part HI], and some of these 
issues are difficult to evaluate prior to looking at the data. Here, an exact, 
adaptive test is proposed that chooses, after the fact, the less sensitive of 
two analyses, exactly correcting the level of the test for the use of two anal- 
yses. Is adapting the test statistic to the data at hand of value in sensitivity 
analyses? 

1.3. Outline: Review; an exact adaptive test; design sensitivity; power. 
Section 2 is a review of existing background material and notation, including 
randomization inference in experiments in Section 2.1, sensitivity analysis in 
observational studies in Section 2.2, and the power of a sensitivity analysis 
and the design sensitivity in Section 2.3; there is little new material in the 
review in Section 2. With notation and background established. Section 3 
discusses why adaptation is important in this context. The new adaptive 
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test is discussed in Section 4, its exact null distribution in Sections 4.1-4.2, 
its nonnull asymptotic properties in Section 4.3, and its finite sample power 
obtained by simulation in Section 4.4. In particular. Proposition 1 of Sec- 
tion 4.3 shows that in each sampling situation, the design sensitivity of the 
adaptive procedure is equal to the maximum of the design sensitivities of the 
two nonadaptive procedures from which it is built. The simulation suggests 
that the asymptotic properties begin to take effect in samples of modest size. 
In Section 5 the methods are applied to the example in Section 1.2 from Sil- 
ber et al. (2007). The discussion in Section 6 considers related alternative 
methods in Section 6.1 and returns in Section 6.2 to the question raised in 
Section 1.1. 

2. Notation and review: Randomization; sensitivity analysis; design sen- 
sitivity. 

2.1. Inference about treatment effects in a randomized experiment. There 
are / matched pairs, i = 1, . . . ,1, of two subjects, j = 1, 2, one treated with 
Zij = 1, the other control with Zij = 0, so Zn + Zj2 = 1 for each i. The sub- 
jects were matched for an observed covariate, Xjj, so Xji = Xi2 for each i, 
but they may differ in terms of an unobserved covariate Uij, so possibly 
Uii / Uj2- Subject ij has two potential responses, namely, rxij if ij is as- 
signed to treatment with Zij = 1 and rcij if ij is assigned to control with 
Zij = 0, so the response observed from ij is Rij = Zijr^ij + (1 — Zij)rcij and 
the effect of the treatment rxij — rcij on subject ij is not observed for any 
subject; see Neyman (1923), Welch (1937), Rubin (1974), Reiter (2000) and 
Gadbury (2001). Fisher's (1935) sharp null hypothesis Hq of no treatment 
effect asserts Hq : rxij = fcij, for i = 1, . . . , /, j = 1, 2, whereas the hypothe- 
sis Ht of an additive constant treatment effect r asserts Hr : rxij = rcij + t 
for all ij. 

Write F = {{rTij,rcij,Xij,Uij),i = 1, . . . , I, j = 1, 2} for the potential re- 
sponses and covariates and write Z for the event that {Zn + Zj2 = l,i = 
1,...,/). In a randomized paired experiment, one subject in each pair i 
is picked at random to receive the treatment, so PT{Zij = l\F,Z) = ^ for 
each ij, with independent assignments in distinct pairs. 

Within pair i, the treated-minus-control difference Yi in observed re- 
sponses is 

Yi = [Zii - Zi2){Rii - R^■2,) = Zii{rTii - rci2) + Zi2{rTi2 - rai) 

= T + £i with £i = {Zii- Zi2){rcii-rci2) if -ffr is true. 

Given F, Z, the quantity rcn — rci2 is fixed, and in a randomized experiment 
Ei = ±|rcji — rci2\ with equal probabilities ^, so if H^ is true, then Yi is 
symmetric about r. 
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Ties among the Yi's are not a problem, but the development is simpler 
if ties of all kinds are assumed absent. In particular, when testing Hr, the 
\Yi — t\ are assumed to be untied, and the Yi — t are assumed to not equal 
zero. Minor adjustments in Section 4.5 eliminate these restrictions. 

When testing H-r, let qi be the rank of iFj — t| and let 5^ = 1 if Fj — r > or 
^j = if li — r < 0; then Wilcoxon's signed rank statistic is W = ^i=i SiQi, 
where the qi are a permutation of 1, 2, . . . , /. Conditionally given J^, Z, if Hr 
is true in a randomized paired experiment, then Yi — t = Si is ib|rcji — rci2\ 
with equal probabilities ^^ so Qi is fixed and 5j = 1 or with equal prob- 
abilities 2) and Wilcoxon's statistic has the distribution of the sum of / 
independent random variables taking the values i or with equal probabil- 
ities 2' This null distribution is the basis for testing H.,-, and by inverting 
the test it yields confidence intervals and Hodges-Lehmann point estimates 
for an additive treatment effect r. See Lehmann (1975) for discussion of 
these standard techniques and for discussion of the good performance of 
Wilcoxon's statistic when applied in randomized experiments. See Maritz 
(1979) for a parallel development of randomization inferences using Ruber's 
m-estimates including the permutational i-test. 

2.2. Sensitivity analysis in observational studies. In the absence of ran- 
domization, there is no basis for assuming that Pv{Zij = 1\T,Z) = ^ and 
therefore no basis beyond naivete for assuming that the inferences in Sec- 
tion 2.1 are correct. A sensitivity analysis in an observational study asks 
how the conclusions in Section 2.1 would change in response to departures 
from Pv{Zij = 1\T,Z) = ^ oi various magnitudes. One model for sensitiv- 
ity analysis [Rosenbaum (2002a), Section 4] begins by assuming that in the 
population before matching treatment assignments are independent with 
unknown probabilities ttjj = Pr(Zjj = 1|-F), and two subjects with, say, ij 
and ij' , with the same observed covariates, Xji = Xj2, may differ in their 
odds of treatment by at most a factor of T > 1, 

, s 1 TTj'i (1 — TTii') 

(1) T^<^ ^<r ifxii = xi2; 

r TTiji{l -TTij) 

then a distribution of treatment assignments for matched pairs is obtained 
by conditioning on the event Z. This is easily seen to be equivalent to 
assuming that 

(2) Y^<Pr(Za = l|-F,Z)<^, Zi2 = l-Ziu i = l,...,/, 

with independent assignments in distinct pairs; see Rosenbaum (2002a), Sec- 
tion 4. To aid interpretation, the one parameter F may be unpacked into 
two parameters, one A controlling the relationship between treatment as- 
signment Zij and the unobserved covariate Uij, the other A controlling the 
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relation between response rcij and Uij, yielding the same one-dimensional 
analysis in terms of T but for all (A, A) that solve T = (AA + 1)/(A + A) 
[Rosenbaum and Silber (2009)]; for instance, r= 1.25 corresponds with an 
unobserved covariate that simultaneously doubles the odds of treatment, 
Zji — Zi2 = 1, and doubles the odds of a positive response difference under 
control, rcii — rci2 > 0, as 1.25 = (2x2 + l)/(2 + 2). In this formulation, 
the parameter A is defined using Wolfe's (1974) semiparametric family of 
asymmetric deformations of a symmetric distribution to place a bound on 
the distribution oircn —rci2] see Rosenbaum and Silber (2009) for specifics. 
Either (1) or (2) says that treatment assignment probabilities are unknown 
but to a bounded degree determined by T. For each fixed T > 1, (2) yields 
an interval of possible values of an inference quantity, such as a P- value or 
point estimate or the endpoint of a confidence interval, and a sensitivity 
analysis consists in computing that interval for several values of F, thereby 
indicating the magnitude of departure from randomization that would need 
to be present to alter the conclusions of the analysis in Section 2.1. For 
instance, for F = 1 the interval of one-sided P- values from Wilcoxon's test 
is a single point, namely, the P- value from the randomization test in Sec- 
tion 2.1, but as F — > oo the interval tends to [0, 1] — that is, association does 
not logically imply causation. The practical question is: how large must F 
be before the interval of P-values is inconclusive, say, including values both 
above and below a conventional level such as 0.05? 

Various methods of sensitivity analysis in observational studies are dis- 
cussed by Cornfield et al. (1959), Copas and Eguchi (2001), Diprete and 
Gangl (2004), Egleston, Scharfstein and MacKenzie (2009), Frangakis and 
Rubin (1999), Gastwirth (1992), Gilbert, Bosch and Hudgens (2003), Hos- 
man, Hansen and Holland (2010), Imbens (2003), Lin, Psaty and Kronmal 
(1998), Marcus (1997), McCandless, Gustafson and Levy (2007), Rosen- 
baum and Rubin (1983), Small (2007), Wang and Krieger (2006), Yanagawa 
(1984), Yu and Gastwirth (2005), among others. 

The discussion has emphasized adjustments for observed covariates by 
matching, as opposed to, say, covariance adjustment. In simulations, Rubin 
(1979) found that model-based adjustments without matching are not ro- 
bust to model misspecification, sometimes increasing rather than reducing 
bias from measured covariates, but he found that model-based adjustments 
of matched pair differences are robust to model misspecification. The meth- 
ods described in the current paper may be applied to residuals of covari- 
ance adjustment of matched pair differences using the device in Rosenbaum 
(2002b), Section 5. Also, the sensitivity model (1) is applicable to a wide 
variety of situations, including binary outcomes and censored survival times 
[Rosenbaum (2002a), Section 4]. 
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2.3. Design sensitivity in observational studies. If an observational study 
were free of unmeasured bias, then we could not determine this from the 
observable data, and the best we could hope to say is that the conclusions are 
insensitive to small and moderate biases. The power of a sensitivity analysis 
is the probability that we will be able to say this [Rosenbaum (2004)]. The 
power of a randomization test anticipates the outcome of such a test under 
an assumed model for data generation in a randomized trial. In parallel, the 
power of a sensitivity analysis with a specific T anticipates the outcome of 
a sensitivity analysis when performed on data from an assumed model for 
data generation. In the favorable situation, the data reflect a treatment effect 
and no bias from unmeasured covariates, and it is in this situation that we 
hope to report insensitivity to unmeasured bias. For instance, we might ask 
the following: if the / matched pair differences were produced by an additive 
constant treatment effect r with no bias and Normal errors, Yi = t + Ei with 
Ei ~i.i.d. -^(0,0"^), then, under this model, what is the probability that the 
entire interval of P-values testing Hq is below 0.05 when computed with, say, 
r = 2? For Wilcoxon's test with / = 100 and r/o" = 1/2, the entire interval 
of P-values computed with F = 2 is less than 0.05 with probability 0.54, so 
there is a reasonable chance that an effect of this magnitude will be judged 
insensitive to a moderately large bias of F = 2. In contrast, the new adaptive 
test proposed in the current paper has power of 0.68 in this same situation, 
a substantial improvement. _ 

As I — ;• cx), there is a value F called the design sensitivity [Rosenbaum 
(2004)] such that the power of a sensitivity analysis tends to 1 if the anal- 
ysis is performed with F < F and tends to zero if the analysis is performed 
with F > F. That is, the power, viewed as a function o£ F is tending to 
a step function with a single step down from 1 to at F = F; see Rosenbaum 
(2010b), Figure 14.3. In the limit, as the sample size increases, data gen- 
erated^by a certain model without bias will be insensitive to biases smaller 
than F and sensitive to biases larger than F. For instance, if 1^ = r + Ej 
with Ei ~i.i.d. N{0,a'^) and r/a = 1/2, the design sensitivity for Wilcoxon's 
statistic is F = 3.17, so for sufficiently large / it is virtually certain that 
Wilcoxon's statistic will report insensitivity to a bias of F if F < 3.17 and 
virtually certain it will report sensitivity to a bias of F if F > 3.17. In con- 
trast, in this same sampling situation, the new adaptive test proposed in 
the current paper has design sensitivity F = 4.97, again a substantial im- 
provement. In particular, in this sampling situation as /— t- oo, the power of 
a sensitivity analysis performed at F = 4 is tending to zero for Wilcoxon's 
test and to one for the new adaptive test. 

Design sensitivity has been described in terms of the power of tests, but 
parallel issues arise in conducting a sensitivity analysis for a confidence 
interval or a point estimate. In a randomized experiment, a test such as 
Wilcoxon's test may be inverted to yield a confidence interval or a Hodges- 
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Lehmann point estimate of an additive treatment effect r, and a more pow- 
erful test yields a typically shorter confidence interval and more accurate 
point estimate; see Hodges and Lehmann (1963) or Lehmann (1975), Sec- 
tion 4. In parallel, in an observational study, a sensitivity analysis for a con- 
fidence interval or a point estimate is obtained by inverting a test, so the 
95% confidence interval for a given T excludes tq if the sensitivity analysis 
for the test rejects Hq:t = tq; see Rosenbaum (1993, 2002a), Section 4.3. 
For a given F > 1, one obtains an interval of possible point estimates and 
a set of possible confidence intervals. As / — )■ oo with F > 1 fixed, both the 
interval of possible point estimates and the union of possible confidence in- 
tervals converges to a real interval [tl,th] of the treatment effects r that 
are compatible with a bias of F, and an increase in design sensitivity will 
shorten that interval; see Rosenbaum [(2005), Proposition 1] for one such 
result. As in experiments, even if one is interested in a confidence interval 
or point estimate, not a hypothesis test, one should obtain that interval or 
estimate by inverting a more powerful test. 

3. Why is adaptation important? Traditionally, adaptive methods have 
selected the best of several statistical procedures using the data at hand and 
they have focused on improving efficiency in randomized experiments in the 
absence of bias; see, for instance, Hogg (1974), Policello and Hettmansperger 
(1976) and Jones (1979). As discussed in Rosenbaum (2010a, 2011), Pitman 
efficiency and design sensitivity both affect the power of a sensitivity analy- 
sis in an observational study, but they can work at cross-purposes. Pitman 
efficiency aims at power to detect small effects in randomized experiments 
where bias is not an issue. In an observational study. Pitman efficiency pre- 
dicts the outcome of a sensitivity analysis for F = 1 in the favorable situation; 
that is, it predicts the outcome of a randomization test applied in an ob- 
servational study when bias is eliminated by adjustments such as matching. 
Small effects, however, are invariably sensitive to small unobserved biases, 
which are absent in an idealized randomized experiment but can never be 
excluded from consideration in an observational study. Procedures with su- 
perior design sensitivity in observational studies look for stable evidence of 
moderately large effects, in effect ignoring pairs i with small \Yi\. 

There are procedures with good Pitman efficiency and better design sen- 
sitivity than Wilcoxon's statistic, and other statistics with poor Pitman 
efficiency and vastly better design sensitivity than Wilcoxon's statistic. For 
instance, in testing Hq, Brown (1981) proposed a statistic which ignores 
the I of pairs with the smallest \Yi\ or qi, gives weight 1 to the signs of 
the I of pairs with the middle values of \Yi\ or g,, and gives weight 2 to 
the remaining ^ of pairs with the largest values of |5^| or q^. Brown (1981) 
shows his statistic is highly robust and almost as efficient as Wilcoxon's 
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Table 1 

Pitman asymptotic relative efficiency versus the Wilcoxon statistic 

for a shift alternative m a paired randomized experiment with 

errors from a Normal distribution, a logistic distribution or a 

t- distribution with 3 degrees of freedom. In the version used here, 

Noether's statistic is the number of positive differences among the 

1/3 of pairs with the largest absolute differences 

Normal Logistic f 3 df 



Sign 


0.67 


0.75 


0.85 


Noether 


0.78 


0.69 


0.59 


Brown 


0.95 


0.94 


0.93 


Wilcoxon 


1.00 


1.00 


1.00 



statistic in a randomized experiment, whereas in Rosenbaum (2010a) it is 
seen that Brown's statistic has higher design sensitivity in a range of sam- 
phng situations; in combination, these two facts produce improved power in 
a sensitivity analysis. Noether (1973) proposed a simpler class of statistics 
that simply counts the number of positive Yi among pairs with large \Yi\ 
or Qi. Markowski and Hettmansperger (1982) studied the Pitman efficiency 
of many statistics similar to those of Brown and Noether by varying the 
number of pairs that are given various weights; see also the group rank 
statistics of Gastwirth (1966) and Groeneveld (1972). In the version used in 
the current paper — but not in Noether's paper — Noether's statistic counts 
the number of positive Yi among the i of pairs with largest ll^l or gj. Hav- 
ing mentioned once that Noether did not promote this specific version of 
his statistic, I will not mention this again, and will refer to the statistic 
as Noether's statistic. Brown's statistic has Pitman efficiency of 0.95 rela- 
tive to the Wilcoxon statistic for an additive effect with Normal errors, but 
Noether's statistic has Pitman efficiency of only 0.78, so one would not use 
this version of Noether's statistic for Normal data from a randomized exper- 
iment. Table 1 gives Pitman efficiencies in a paired randomized experiment. 
In contrast, in a sensitivity analysis in an observational study, if Yi = t + £i 
with Ei ~i.i.d. N{0,a'^) and r/a = 1/2, the design sensitivity for Wilcoxon's 
statistic is r = 3.17, for Brown's statistic is F = 3.60 and for Noether's statis- 
tic is r = 4.97, so for sufficiently large / Noether's statistic will be the best 
performer in a sensitivity analysis. 

The adaptive test uses both Brown's statistic and Noether's statistic. Ad- 
justing the critical values to control the level of the test, the adaptive test 
rejects if either Brown's statistic or Noether's statistic supports rejection. 
In every sampling situation, the adaptive test has the larger of the two de- 
sign sensitivities for Brown's and Noether's statistics. The important issue, 
however, is the power of a sensitivity analysis for finite /. Because asymp- 
totic claims for some adaptive procedures are not readily seen in samples 



12 P. R. ROSENBAUM 

of plausible size, the current paper uses the exact null distribution of the 
adaptive test and emphasizes finite sample power determined by simulation. 
The pairing of Brown's statistic and Noether's statistic is a pairing of two 
strong candidates for which the required exact calculations are feasible. 

4. An adaptive test. 

4.1. The exact null distribution in a sensitivity analysis. Let < Ai < 
A2 < 1. Let /i be the number of pairs with absolute ranks Qi > {1 — Xi)I 
and let Bi be the number of positive Yi among these Ji pairs. Also, let I2 
be the number of ranks with (1 — Ai)/ > 9i > (1 — A2)/ and let B2 be the 
number of positive Yi among these I2 pairs. Noether (1973) proposed Bi 
as a test statistic, and Brown (1981) and Markowski and Hettmansperger 
(1982) proposed T = 2Bi +i?2 as a test statistic; see also Gastwirth (1966). 

Let Bi and B2 be independent binomials with sample sizes Ii and I2 
and probabilities of success k = r/(l + F), and let Bi and B2 be inde- 
pendent binomials with sample sizes /i and I2 and probabilities of success 
K = l/(l+r). Also, let T = 2Bi+B2 and r = 2!Bi + B2- A function 5 (•, •) 
is monotone increasing if 5(61,62) < (7(6'^, 62) whenever 61 < b'^ and 62 ^ 62- 
Under the sensitivity model (2), if Hq is true, then it is not difficult to 
show [Rosenbaum (2002a), Section 4] that for every monotone increasing 
function g {■,■), 

Pv{g(BuB2) >k}< Pr{r7(Si, B2) > k\T,Z} 

< Pi{g{Bi,B2) > k} for every k, 

and the bounds in (3) are sharp in the sense of being attained for some 
Pr(Zji = 1\J^,Z) that satisfy (2), so the bounds (3) cannot be improved 
without additional information that further restricts Pr(Zji = 1\J^,Z). If 
r = 1, then there is equality throughout (3) and then (3) is the randomiza- 
tion distribution of g{Bi,B2) under Hq. 

Let g{Bi,B2) = 1 if Bi > kB,r or 2Bi + B2 > kT,v and g{Bi,B2) = 
otherwise, for suitable constants ks^r and kj^r', then g{-,-) is monotone in- 
creasing. For a given F > 1, the adaptive test rejects Hq at level a for all ttjj 
satisfying (1) if Si > ks^r or 2Bi + -B2 > ^r,r- The constants kB,r and /cT,r 
are determined to satisfy the following conditions: 

(4) Pr(Si > kB,r or T > kx^r) < a, 

Pr(Si > fc^ r — 1 or T > fcrr) > CK and 

(5) = ' = ' 
Pr(Si > kB,r or T > kT,r - I) > a 

and 

(6) |Pr(i?i > ks^r) — Pr(T > kT,r)\ is minimized subject to (4) and (5). 
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The joint distribution of {Bi,B2) is that of two independent binomials and 
in R is given by outer(dbinom(0: Ii, /i, k), dbinom(0: I2, h, k), "*"); then 
finding ks^v and kx^r to satisfy (4)-(6) is simply arithmetic. 

Although I have never seen this, in principle, there could be two values, 
{kB,r,kT,r) and (^^p'^Tr) ~ i^B,r — li^T,r + 1) that both satisfy (4)-(6). 
To avoid this ambiguity in the definition of the adaptive procedure, simply 
use {kB,r,kT,r) in this extremely unlikely case, thereby preferring to reduce 
the critical value for Brown's statistic T. 

4.2. Numerical example of the null distribution. To illustrate the com- 
putations in (4)-(6), take / = 250 untied pairs, T = 4, a = 0.05, and Ai = 
1/3, A2 = 2/3; then, h = 84, h = 83, k = 4/5. This yields kB,r = 74 and 
kT,r = 216 with 

(7) Pr(li > 74 or f > 216) = 0.0488 < a = 0.05, 

(8) Pr(li > 74) = 0.0370, Pr(f > 216) = 0.0320, 

(9) |Pr(^i > 74) - Pr(r > 216) | = 0.0050. 

In light of this, for F = 4, the upper bound on the one-sided P-value testing 
no effect would be less than a = 0.05 if either i?i > 74 or T = 2Bi -\-B2>216. 
Several aspects of the illustration (7)~(9) deserve comment. First, if one 
were to test using Bi alone, ignoring B2, then at F = 4 the upper bound 
on the one-sided P-value testing no effect would be less than a = 0.05 if 
Bi > 74, because Pr(li > 74) = 0.0370 and Pr(li > 73) = 0.0691; that is, 
in this particular case, owing to the discreteness of the binomial distribu- 
tion, the adaptive test will reject in every instance in which the test based 
on Bi alone rejects and the adaptive test will reject in some other cases as 
well. Conversely, if one were to test using T alone, then at F = 4 the upper 
bound on the one-sided P-value testing no effect would be less than a = 0.05 
if T > 215 rather than kT,r = 216 in (7) because Pr(r > 215) = 0.04288 and 

Pr(r > 214) = 0.05642. So, in this one numerical example, the adaptive test 
rejects in every instance in which the test based on Bi rejects and also in ev- 
ery instance in which T rejects except Bi < 74 and T = 215. Use of the Bon- 
ferroni inequality to approximate Pr(i?i > 74 or T > 216) would err substan- 
tially, with Pr(Pi > 74 or T > 216) = 0.0488 < Pr(Pi > 74) + Pr(T > 216) 
= 0.0370 + 0.0320 = 0.0690. 

4.3. Design sensitivity of the adaptive test. As discussed in Section 2.3, 
in an observational study, the favorable situation means there is a treatment 
effect and no bias from an unmeasured covariate. In an observational study, 
we cannot know from the data whether we are in the favorable situation, so 
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the best we can hope to say is that the study's conclusions are insensitive 
to smah and moderate biases. The power of an a-level sensitivity analysis, 
< a < 1, performed with a specific F > 1, is the probability that the entire 
interval of possible P- values from the sensitivity analysis is less than or equal 
to a. For the adaptive test, the power of the sensitivity analysis for fixed F 
is the probability that Bi > kB,v or T > kx^r when Bi and T are computed 
from data that are, in fact, measuring a treatment effect without bias. In 
principal, one could compute the power conditional upon F, but this would 
mean that the power would be a function of F, so, in practice, one computes 
the unconditional power averaging over a simple model for the generation 
of F. As noted in Section 2.3, the design sensitivity is a number F such 
that the power of an a-level sensitivity analysis tends to 1 as / — t- oo if the 
sensitivity analysis is performed with F < F and the power tends to zero if 
the analysis is performed with F > F. 

In the current paper, the favorable situation refers to treated-minus- 
control differences Yi that are drawn independently from a continuous cu- 
mulative distribution F{-) that is strictly increasing, F{y) < F{y') if y < y' . 
One of many such favorable situations is Yi = t -\- £i where the £i are inde- 
pendent and identically distributed observations from a continuous, strictly 
increasing distribution with a density symmetric about zero. 

For y > 0, let H{y) = F{y) - F{-y); then H{y) = FT{\Yi\ < y), and for A G 
[0, 1), the inverse function is well defined with H^^{X) = y if A = Pr(|yi| < y). 
Also define ({X) to be the probability that a Yi is both positive, Yi > 0, and 
in the largest A of the \Yi\, that is, define 

C(A) = 1 - F{H-\l - A)} = FiiiYi > 0) A {\n > H-\l - A)}]. 

Let Fno, Fbmh and F^d be the design sensitivities for, respectively, Noether's 
statistic Bi, the Brown-Markowski-Hettmansperger statistic T and the 
adaptive procedure with critical values (4)-(6). That is, Bi counts the posi- 
tive y^'s among the largest Ai of the \Yi\, and T = 2Bi + B2 doubles Bi and 
adds the count of the positive YiS among the next A2 — Ai of the |1^|. 

Proposition 1. IfYi,i = l, ... ,1 are independent observations from F{-), 

C(Ai) 



(10) F 

(11) 



no 



bmh 



Ai-C(Ai)' 

C(Ai) + C(A2) 



{Ai-C(Ai)} + {A2-C(A2)}' 

(12) Fad = max(Fno,Fbmh)- 

Proof. The proof uses Proposition 2 of Rosenbaum (2010a) which con- 
cerns the design sensitivity of a signed rank statistic with general scores; in 
particular, Bi and T are two such signed rank statistics. Equations (10) 
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Table 2 

Design sensitivity F m the favorable situation with an additive treatment 

effect, T and errors Si with variance a^ that are Normal, logistic or 

t- distributed with 3 degrees of freedom 
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Statistic 


Normal 


Logistic 


t 3df 






T/cr=l/4 




Sign 


1.49 


1.57 


1.88 


Wilcoxon 


1.76 


1.83 


2.21 


Brown 


1.86 


1.93 


2.34 


Noether 


2.12 


2.14 


2.48 


Adaptive 


2.12 


2.14 

r/a = l/2 


2.48 


Sign 


2.24 


2.48 


3.44 


Wilcoxon 


3.17 


3.40 


4.74 


Brown 


3.60 


3.83 


5.39 


Noether 


4.97 


4.72 


5.77 


Adaptive 


4.97 


4.72 

r/a = 3/4 


5.77 


Sign 


3.41 


3.90 


6.02 


Wilcoxon 


5.92 


6.42 


9.70 


Brown 


7.55 


7.91 


11.69 


Noether 


13.48 


10.86 


12.08 


Adaptive 


13.48 


10.86 


12.08 



and (11) are obtained by simplifying expression (8) in Proposition 2 of 
Rosenbaum (2010a), which is a formula for the design sensitivity with gen- 
eral scores. As shown in the proof of that proposition, in a sensitivity analysis 
performed at a specific value of T, the upper bound on the P-value for Bi 
converges in probability to zero as / — )• oo if F < Fno and it converges to 1 if 
r > Fno, and, in parallel, the upper bound on the P- value for T converges in 
probability to zero as / — )■ oo if F < Fbmh and it converges to 1 if F > Fbmh- 
As a consequence, the smaller of these two P-values for Bi and T tends to 
zero as / — )• oo if F < max(Fno, Fbmh) and it tends to 1 if F > max(Fno, Fbmh), 
proving (12). □ 



Table 2 calculates the design sensitivity of the sign statistic, the Wilcoxon 
signed rank statistic, Noether's statistic with Ai = 1/3, Brown's statistic 
with Ai = 1/3, A2 = 2/3, and the adaptive procedure with critical values (4)- 
(6). In Table 2, Yi = t + e-i where var(ej) = cj^, the effect size is specified in 
units of the standard deviation, r/a, and Ei has a standard Normal distri- 
bution, a logistic distribution or a central t-distribution with 3 degrees of 
freedom. For example, if one takes cr = 1, then for the Normal r/a = 1/2 if 
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r = 1/2, for the logistic r/cr = 1/2 if r = (l/2)(7r/V3) = 0.907, and for the 
t-distribution with 3 degrees of freedom, r/a = 1/2 if r = (l/2)-v/3 = 0.866. 
Although Fno > Tbrnh throughout Table 2, there are many situations with 
Fno < Tbmh; for instance, with Ai = 1/3 this can occur in a t-distribution 
with 2 or 1 degrees of freedom, where the t with 1 degree of freedom is the 
Cauchy distribution, and it occurs in the t-distribution with 3 degrees of 
freedom in Table 6 of Section 6.1 with Ai < 1/3. 

As an illustration of the properties of design sensitivity, consider the case 
of r/a = 1/2 in Table 2 for the t-distribution with 3 degrees of freedom. The 
design sensitivity for Wilcoxon's statistic in this case is F = 4.74, whereas 
for Noether's statistic it is Fno = 5.77. In sufficiently large samples from this 
distribution, Wilcoxon's statistic should be sensitive to a bias of magnitude 
r = 5 but Noether's statistic should not. Drawing a single sample of / = 
10,000 pairs from this distribution and performing a sensitivity analysis with 
r = 5 yields an upper bound on the P- value for Wilcoxon's statistic of 0.9985 
and for Noether's statistic of 0.0071, so a deviation from random assignment 
of magnitude F = 5 could readily explain the observed value of Wilcoxon's 
statistic, but not the observed value of Noether's statistic. At the a = 0.011 
level with T = 5, the adaptive test rejects Hq because Noether's statistic has 
passed its critical point in (4) although Brown's statistic has not. Because / 
was very large in this illustration, test performance was predicted by the 
design sensitivity, but in smaller sample sizes, both design sensitivity and 
efficiency affect test performance. 

4.4. Simulation: Power of a sensitivity analysis in the favorable situa- 
tion. The power of a sensitivity analysis is examined for finite I by simu- 
lation in Tables 3 and 4. The tables describe the favorable situation: there 
is a treatment effect and no bias from unobserved covariates, but of course 
the investigator does not know this in an observational study, and so per- 
forms a sensitivity analysis. The power of a 0.05-level sensitivity analysis is 
the probability that the upper bound on the one-sided P-value is less than 
0.05. The power is determined for Wilcoxon's signed rank test, Brown's test, 
Noether's test and the adaptive test that uses both Brown's and Noether's 
tests. 

In Tables 3 and 4, there is an additive effect and no bias from unobserved 
covariates, that is, Yi = t + Ei and the £« are independent and identically 
distributed with a Normal, a logistic or a central t-distribution with 3 degrees 
of freedom. In Table 3, the effect is half the standard deviation a of the Ej's, 
r/a = 1/2, whereas in Table 4, the effect is either r/a = 1/4 or r/cr = 3/4. 

Each sampling situation is replicated 10,000 times. Therefore, the stan- 
dard error of the simulated power is at most y^1/(4 x 10,000) = 0.005. In 
each sampling situation for each F, the two highest powers are in bold. 
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Table 3 

Simulated power with I pairs of a 0. 05 level sensitivity analysis performed with 

sensitivity parameter V . In each situation, there is no bias and there is an additive 

constant treatment effect t whose magnitude is half the standard deviation of the pair 

differences Yi, so r/a = 1/2. Each situation is replicated 10,000 times. In each 

comparison, the two highest powers are in bold 



Pairs: 




1 = 100 






7 = 250 






1 = 500 




r: 


1 


2 


3 


2 


3 


4 


3 


4 


5 










Normal 


errors, t / a 


= 1/2 








Wilcoxon 


1.00 


0.53 


0.05 


0.89 


0.07 


0.00 


0.10 


0.00 


0.00 


Brown 


1.00 


0.61 


0.10 


0.93 


0.19 


0.01 


0.35 


0.01 


0.00 


Noether 


0.99 


0.64 


0.29 


0.96 


0.55 


0.15 


0.81 


0.24 


0.03 


Adaptive 


1.00 


0.68 


0.17 


0.97 

Logistic 


0.45 

errors, r/i 


0.15 

T=l/2 


0.76 


0.24 


0.03 


Wilcoxon 


1.00 


0.65 


0.10 


0.95 


0.16 


0.00 


0.25 


0.00 


0.00 


Brown 


1.00 


0.70 


0.15 


0.97 


0.28 


0.02 


0.53 


0.03 


0.00 


Noether 


0.99 


0.61 


0.26 


0.95 


0.48 


0.11 


0.75 


0.17 


0.02 


Adaptive 


1.00 


0.70 


0.18 


0.97 


0.41 


0.11 


0.70 


0.17 


0.02 








t 


errors with 3 d.f., t /a = 1/S 


1 






Wilcoxon 


1.00 


0.94 


0.45 


1.00 


0.81 


0.21 


0.98 


0.33 


0.02 


Brown 


1.00 


0.94 


0.48 


1.00 


0.86 


0.37 


0.99 


0.62 


0.10 


Noether 


1.00 


0.77 


0.42 


0.99 


0.75 


0.29 


0.95 


0.50 


0.13 


Adaptive 


1.00 


0.92 


0.49 


1.00 


0.87 


0.37 


0.99 


0.58 


0.14 



Based on Table 1, we expect Wilcoxon's statistic to have the highest power 
for r = 1. Based on Table 2, we expect that for sufficiently large T and /, 
Noether's statistic will have the highest power. Combining Tables 1 and 2, 
we see that, for the t-distribution with 3 degrees of freedom, Brown's statistic 
is much more efficient than Noether's statistic but has only slightly inferior 
design sensitivity, so Brown's statistic could have higher power for quite 
large I. Proposition 1 suggests that the adaptive procedure has fulfilled its 
potential if it has power close to the maximum of the powers of Brown's and 
Noether's statistics. With a few exceptions, these expectations are con&med 
in Tables 3 and 4. Notably, in Tables 3 and 4, the adaptive procedure is 
never very bad, whereas other statistics perform poorly in some cases; for 
instance, in Table 4 the power loss is 90% for Wilcoxon's statistic compared 
to Noether's statistic for / = 500 pairs. Normal errors, t/ct = 3/4. 

It is useful to contrast Tables 2, 3 and 4._For instance, the design sen- 
sitivity (as I —7- oo) of Noether's statistic is F = 4.97 for matched pair dif- 
ferences Yi that are Yi ~i,i.d. -^(1, 1) in Table 2, but the power is only 15% 
in this case at F = 4 < 4.97 = F for / = 250 pairs in Table 3. That is, if 
Yi ~i.i.d. -^(21 1) with / = 250 pairs, there is an 89% chance the results will 
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Table 4 

Simulated power with I pairs of a 0. 05 level sensitivity analysis performed with 

sensitivity parameter V . In each situation, there is no bias and there is an additive 

constant treatment effect r whose magnitude is either 1/4 or 3/4 of the standard 

deviation a of the pair differences Yi, so r/a = 1/4 or r/a = 3/4. Each situation is 

replicated 10,000 times. In each comparison, the two highest powers are in bold 







T/a-- 


= 1/4 






T/a-- 


= 3/4 




Pairs: 


1 = 


100 


1 = 


500 


1 = 


100 


1 = 


500 


F: 


1 


1.5 


1.5 


1.75 


2.5 


3.5 


5 


6 










Normal ( 


3rrors 








Wilcoxon 


0.78 


0.16 


0.44 


0.05 


0.92 


0.49 


0.28 


0.02 


Brown 


0.75 


0.18 


0.53 


0.11 


0.94 


0.67 


0.78 


0.33 


Noether 


0.60 


0.20 


0.65 


0.33 


0.97 


0.78 


0.99 


0.92 


Adaptive 


0.72 


0.22 


0.67 


0.28 

Logistic 


0.96 

errors 


0.80 


0.99 


0.87 


Wilcoxon 


0.83 


0.20 


0.59 


0.11 


0.95 


0.60 


0.51 


0.08 


Brown 


0.79 


0.21 


0.66 


0.18 


0.95 


0.71 


0.85 


0.42 


Noether 


0.60 


0.19 


0.65 


0.33 


0.93 


0.67 


0.93 


0.75 


Adaptive 


0.76 


0.23 


0.70 


0.29 

t errors, 


0.96 

3d.f. 


0.73 


0.94 


0.68 


Wilcoxon 


0.96 


0.47 


0.97 


0.67 


1.00 


0.92 


1.00 


0.91 


Brown 


0.94 


0.47 


0.98 


0.74 


1.00 


0.93 


1.00 


0.97 


Noether 


0.75 


0.33 


0.90 


0.67 


0.95 


0.74 


0.97 


0.87 


Adaptive 


0.92 


0.44 


0.97 


0.74 


1.00 


0.90 


1.00 


0.97 



be sensitive at F = 4, even though as I — )• oo the same distribution would 
eventually be seen to be insensitive at F = 4. The design sensitivity F in Ta- 
ble 2 refers to the limit as / — ?■ oo , so results will typically become sensitive 
at a smaller F, F < F, in a finite sample, / < oo. Although Noether's statistic 
is always best in Table 2, it is not always best in Tables 3 and 4; however, 
the adaptive test is never far behind the best test in Tables 3 and 4. 



4.5. Ties. Ties are addressed in a straightforward manner when test- 
ing Hr- Providing fewer than (1 — A2)/ of the Yi — t are equal to zero, no 
adjustment for zero differences is needed in the discussion in Section 4.1; 
that is. Brown's statistic and the adaptive procedure require no adjustment 
unless more than 1/3 of the sample is tied at zero. For ties among the jYi — t|, 
use average ranks in computing qj; then /i and I2 are random variables that 
depend upon the pattern of ties, but the procedure in Section 4.1 yields 
a test that is conditionally distribution-free given the realized values of /i 
and /2- 
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5. Use of the adaptive procedure in the study of treatments for ovarian 
cancer. The matched pair difference in weeks with toxicity in the first year 
after diagnosis is highly significant in randomization tests; for instance, the 
randomization based P-value from Wilcoxon's signed rank test is less than 
10~^. For r = 1.3 and F = 1.6, the upper bounds on the one-sided P-value 
from Wilcoxon's test are, respectively, 0.0032 and 0.128, so a bias of F = 1.3 
could not easily produce the observed value of Wilcoxon's statistic, but a bias 
of F = 1.6 could do so. In contrast, the upper bound on the one-sided P- 
value from the adaptive procedure is 0.004 for F = 1.6, so the magnitude of 
bias that would explain the behavior of Wilcoxon's statistic does not begin 
to explain the behavior of the adaptive test. [The P-value at F = 1.6 for 
the adaptive test is the smallest a in (4)-(6) that leads to rejection.] The 
upper bound on the P-value from the adaptive test crosses 0.05 between 
F = 1.96 to F = 1.97. At F = 1.96, the adaptive procedure rejects based on 
Noether's statistic, which if used on its own would have an upper-bound 
on its one-sided P-value of 0.038. To put these quantities in context using 
the approach in Rosenbaum and Silber (2009), F = 2 corresponds with an 
unobserved covariate Uij that produces a three-fold increase in the odds of 
greater toxicity and a five-fold increase in the odds of treatment by a medical 
oncologist, so the adaptive test reports considerably less sensitivity to bias 
from an unmeasured covariate than does Wilcoxon's test. 

Similar results are found over the first five years. The upper bound on the 
P-value from Wilcoxon's test is 0.080 for F = 1.7, whereas for the adaptive 
test, the upper bound on the P-value is 0.047 for F = 2.2. As before, it is 
Noether's test, not Brown's test, that leads the adaptive test to reject. 

To illustrate the calculations for toxicity in the first year, allowing for ties, 
Noether's statistic looks at the largest /i = 100 of the \Yi\ finding Pi = 82 
of these have 1^ > 0, whereas Brown's statistic looks at the largest /i + 12 = 
110 + 106 = 226 of the \Yi\ and the statistic has value T = 2Pi + P2 = 222. 
Using the binomial distribution as discussed in Section 4.1 with F = 1.96, 
one finds Pr(Pi > 82) = 0.0381, Pr(ri > 237) = 0.0293, Pr(Pi > 82 or Ti > 
237) = 0.0475, so the adaptive procedure rejects at the 0.05 level for every 
bias less than F = 1.96, but only Noether's test, not Brown's test, would 
have led to rejection used on its own. 

In Section 6.1 the choice of (Ai,A2) is discussed. As a prelude to that 
discussion, consider the results of the sensitivity analysis for toxicity in the 
first year for two choices of (Ai, A2) besides (1/3,2/3). If Ai = 1/6 and A2 = 
2/6 are used in place of Ai = 1/3 and A2 = 2/3, the adaptive procedure has 
an upper bound on the one-sided P-value of 0.046 for F = 3.3. If Ai = 1/8 
and A2 = 2/8 are used, the adaptive procedure has an upper bound on the 
one-sided P-value of 0.046 for F = 3.7. Using Ai = 1/8 and allowing for ties 
in the / = 344 pairs, Noether's statistic focuses on the 45 of 344 pairs with 
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the largest \Yi\ and finds that 41 of these 45 pairs have Yi > 0. In words, 
when there was a large difference in weeks with toxicity, it was usually the 
result of greater toxicity in a patient treated by a medical oncologist, and 
this seems unlikely to have occurred by chance if the magnitude of bias from 
nonrandom assignment is F < 3.7. 

6. Discussion. 

6.1. Variations on a theme: Other X's; other statistics. The adaptive 
procedure in Section 4 uses two compatible tests statistics from the statis- 
tical literature. Brown's (1981) statistic was designed to be a serious com- 
petitor of Wilcoxon's statistic in a randomized experiment without bias, yet 
Brown's statistic has higher design sensitivity when errors are Normal or 
logistic or t-distributed with 3 degrees of freedom. The version of Noether's 
(1973) test used here has poor Pitman efficiency in these cases but much bet- 
ter design sensitivity. So the adaptive procedure adapts between a procedure 
with good Pitman efficiency with good design sensitivity and a procedure 
with poor Pitman efficiency and excellent design sensitivity. There are, of 
course, many possible variations on this theme, some more promising than 
others. 

The statistics of Brown and Noether take one or two large steps, but 
otherwise are constant as functions of the ranks qi of the \Yi\. Are large flat 
steps useful? Both statistics decrease the weight attached to small \Yi\ and 
increase the weight attached to large | Yi \ without emphasizing the extremely 
large \Yi\. Would a gradual increase be better than a step? Consider ranks 
that equal qi/I ii qi/I > 1 — X and equal ii qi/I < 1 — X; call this the 
"(1 — A)-step Wilcoxon statistic" because it uses Wilcoxon's ranks above 
1 — A. Wilcoxon's statistic is the 0-step Wilcoxon statistic. The 2/3-step 
Wilcoxon statistic takes a step where Noether's statistic takes a step, but 
it increases gradually thereafter, and the 1/3-step Wilcoxon statistic takes 
a step where Brown's statistic takes its first step, but it increases gradually 
thereafter. Table 5 contrasts the design sensitivities of Brown's statistic, 
Noether's statistic and comparable step- Wilcoxon statistics in the case of 
an additive treatment effect whose magnitude is half the standard deviation 
of the errors. While the difference between Brown's statistic and Noether's 
statistic is large, the difference between either of these and its comparable 
step- Wilcoxon statistic is not large. 

Brown's statistic focuses on the largest 2/3 of the |Yi|, while Noether's 
statistic focuses on the largest 1/3 of \Yi\. In the example in Section 1.2, 
further tinkering with Ai and A2 led to greater insensitivity to unmeasured 
bias. Markowski and Hettmansperger (1982) discuss the choice of Ai and A2 
from the perspective of Pitman efficiency. Table 6 compares the design sen- 
sitivities for several values of Ai with A2 = 2Ai. 
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Table 5 

Design sensitivities for Brown's statistic, Noether's statistic and for 

two comparable step-Wilcoxon statistics. The table refers to an 

additive treatment effect that is half the standard deviation of the 

errors, for errors with a Normal distribution, a logistic distribution or 

a t- distribution with 3 degrees of freedom 





Normal 


Logistic 


t 3df 


Brown 


3.60 


3.83 


5.39 


1/3-step Wilcoxon 


3.60 


3.83 


5.35 


Noether 


4.97 


4.72 


5.77 


2/3-step Wilcoxon 


5.20 


4.80 


5.64 



In thinking about Table 6, several cautions are needed. First, the design 
sensitivity refers to a limit as the number I of pairs increases / — )• oo, so 
Table 6 is unlikely to offer useful guidance unless lAi is a reasonably large 
number. With I = 100 pairs and Ai = 1/8, there are only 13 pairs counted 
in Noether's statistic, so asymptotic theory is not likely to provide useful 
guidance. Second, the Pitman efficiencies for Noether's statistic with Ai < 
1/3 are substantially worse than the already disappointing values shown 
in Table 1, so Table 6 is only relevant when the sample size / is so large 
that the design sensitivity has come to dominate the Pitman efficiency, as 

Table 6 

Design sensitivities for the Brown-Markowski-Hettmansperger statistic, Noether's 

statistic and the adaptive statistic for various values of Ai with A2 = 2Ai . The table refers 

to an additive treatment effect that is half the standard deviation of the errors, for errors 

with a Normal distribution, a logistic distribution or a t- distribution with 3 degrees of 
freedom. The largest design sensitivity m a sampling situation (or in a column) is in bold 

Ai Normal Logistic t 3 df 



Brown-Markowski-Hettmansperger 


1/3 


3.60 


3.83 


5.39 


Noether 


1/3 


4.97 


4.72 


5.77 


Adaptive 


1/3 


4.97 


4.72 


5.77 


Brown-Markowski-Hettmansperger 


1/4 


4.36 


4.37 


5.67 


Noether 


1/4 


5.87 


5.06 


5.53 


Adaptive 


1/4 


5.87 


5.06 


5.67 


Brown-Markowski-Hettmansperger 


1/6 


5.58 


4.93 


5.51 


Noether 


1/6 


7.28 


5.41 


5.03 


Adaptive 


1/6 


7.28 


5.41 


5.51 


Brown-Markowski-Hettmansperger 


1/8 


6.55 


5.23 


5.20 


Noether 


1/8 


8.40 


5.59 


4.64 


Adaptive 


1/8 


8.40 


5.59 


5.20 
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it will do in the limit as / — t- oo because the power function tends to a step 
function dropping from power 1 to power at T. There are, however, many 
large observational studies, for example, Volpp et al. (2007) conducted an 
observational study of 8.5 million hospital admissions. Third, the columns of 
Table 6 refer to distributions that differ greatly in their tails, so an answer 
that depends strongly upon which column is considered is an answer that 
depends strongly on the behavior of the most extreme observations. 

With these cautions firmly in mind, consider Table 6. In Table 6, Ai = 1/8 
is best for the Normal and logistic distributions and Ai = 1/3 is best for the 
t-distribution with 3 degrees of freedom; see Rosenbaum [(2010a), Figure 2] 
for a heuristic explanation of the relationship between tail behavior, weights 
and sensitivity to bias. With smaller Ai's, the adaptive procedure looks 
attractive: for Ai = 1/8 it uses Noether's test to advantage for Normal errors 
and it uses the Brown-Mar kowski-Hettmansperger test for t-errors. Notably, 
in Table 6, the adaptive procedure exhibits relatively stable performance 
as Al decreases for the t-distribution, but it captures large gains for the 
Normal and logistic distributions. 

6.2. Are large observational studies less susceptible to unmeasured biases? 
Section 1 began with the question: are large observational studies less suscep- 
tible to unmeasured biases? The success of the adaptive procedure suggests 
that this question is incorrectly posed. An observational study is sensitive 
to biases of a certain magnitude, and the sample size is not the key element 
in determining this. However, a poor choice of test statistic — perhaps the 
Wilcoxon statistic — may lead to a sensitivity analysis that exaggerates the 
degree of sensitivity to unmeasured biases. A good choice of test statistic 
may depend upon features of the observable distributions that are unknown 
to the investigator prior to the investigation. To the extent that a large sam- 
ple size permits us to see clearly these features of observable distributions, it 
may let us adapt the statistical analysis so that a poor choice of test statistic 
does not exaggerate the degree of sensitivity to unmeasured biases. 
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